Many people want to use the Web to publish information they create in QuarkXPress™ format. Consequently, there are many ways to do so. The most efficient way, however, is to separate the content of such QuarkXPress documents from the documents themselves, and store that content in a structured format such as XML. Then you can reuse the content not only on the Web, but in other formats as well print, CD-ROM, you name it.
Avenue.quark™ software is designed to make it easy for you to extract your QuarkXPress content and store it in XML format.
Avenue.quark lets you extract the content of QuarkXPress documents and store that content in XML format. You can then easily reuse the content in a variety of ways, including on the Web. This section covers the process and definitions in brief; more detailed descriptions follow in subsequent sections.
What is content?
Content is the information that makes your documents valuable. For example, the content of a magazine may include articles, photographs, interviews, and diagrams.
Content can also be defined by what it is not. For example, headers, footers, and "Continued on page x" notes are generally not considered to be part of a magazine's content. Rather, they're part of the magazine's presentation aspects of the magazine that are desirable only when the magazine is being presented in printed format. Presentation may change depending on the medium through which information is published, but content generally stays the same.
Avenue.quark lets you separate content from presentation by extracting that content from your QuarkXPress documents and storing it in XML format. Then you can re-use that content with different presentations in print, on the Web, on CD-ROM, and so forth. You need only adjust the presentation for each setting.
What is XML?
XML stands for Extensible Markup Language. XML is a way for you to specify the structure of content and label the pieces of that content in a meaningful way.
Labeling content
Why do we need to label content? Because although we can pick up a magazine and know that a particular line of text is a headline, such distinctions aren't so easy for a computer. XML lets you "tag" information in a way that computers can understand. And once a computer understands that a particular line of text is a headline, it can automatically format that line as a headline.
To label a piece of content in XML, you insert an opening XML tag before the content and a closing XML tag after the content, like so:
<headline>Internet Grows by 400%</headline>As you can see, an opening tag consists of an element name between a < and a >. A closing tag is the same, with a / after the <. Here, we've "tagged" the text "Internet Grows by 400%" as a headline by putting it between opening and closing <headline> tags.
Identifying structure
We know that a news story generally consists of a headline, a byline, body text, and some photos or diagrams with captions. However, computers don't know such things until you tell them.
XML lets you describe the structure of your documents with DTDs, or document type definitions. A DTD specifies that the information in a document will use a particular set of tags and follow a particular set of structure rules. For example, a DTD for a news story might specify that:
By consistently adhering to the rules of a DTD, an organization can ensure that its documents are always structured predictably and consistently. This makes it much easier for organizations to move content from one medium to another for example, from print to the Web, or vice versa.
Avenue.quark requires the use of DTDs. For information on creating and adapting DTDs, see "Working With DTDs" and "Industry-Standard DTDs" in this chapter.
A "neutral" format
XML is a "neutral" format in that it contains no information about formatting. Because of this, it can be used with a wide variety of applications, which can apply different kinds of formatting when the content is presented through different kinds of media.
For a more detailed discussion of XML, see "Understanding XML" in this chapter.
What can I do with content stored in XML format?
Once you've extracted the content from a QuarkXPress document, you can use that content in a variety of ways. For example, you can dynamically translate XML-tagged content into HTML format and serve it on the Web. This method of converting QuarkXPress content to HTML is superior to simple HTML export because it lets you easily format, reformat, and reorganize the content.
Now that you have a general idea of what avenue.quark is and how it works, let's take a look at the nuts and bolts, beginning with a look at XML.
XML (Extensible Markup Language) is a way of specifying the structure of documents and labeling specific pieces of content with tags. XML's structural controls let you make sure that all of the necessary parts of that document are present and occur in the proper order. Labeling content makes it easy for other applications to use or display that content.
Before we consider how XML accomplishes all of this, though, let's talk about why it's necessary.
The problems XML solves
XML was derived from an older and more complicated markup language named SGML (Standard Generalized Markup Language). XML was created to solve a variety of related problems, some of which were originally solved by SGML, others of which are unique.
Assigning structure and labels to information
XML is sometimes referred to as a "meta-language" because it lets you define customized markup languages for specific uses. The way you do this is by creating a DTD (document type definition). A DTD specifies what kind of information may go into a document, how the various parts of the document should be tagged (labeled), the order in which the parts should occur, and how many of each part are allowed. A document is considered "valid" according to a given DTD only if it follows that DTD's rules.
DTDs let you enforce the structure of documents. If you have a document's DTD, you know what kind of information to expect when you open that document. DTDs also make the information in XML documents easy for computers to process; if a computer can "understand" a DTD, it can "understand" the information in any XML document that adheres to that DTD. For example, given a document's DTD, a computer program might let you search through every occurrence of a particular type of information (such as company name) in that document, or produce an HTML page that lists all occurrences of that type of information (for example, a list of company names).
Specialized DTDs have already been developed for chemistry, mathematics, technical documentation, and even fictional works. Potential applications include workflow control, software specification, and just about any other field of endeavor that involves the exchange of structured information.
Unlike SGML, XML lets you create "well-formed" documents that is, documents that follow the rules of XML but do not follow a particular DTD. However, it's difficult to maintain consistency among documents if you do not have a standard, and for this reason, avenue.quark requires you to use DTDs.
Making sense of HTML
HTML has proved to be a powerful and versatile format for displaying information on the World Wide Web. However, it has two major shortcomings: it describes only the formatting of data, not its meaning, and you cannot create new HTML tags.
XML solves both of these problems. If you use XML to label data in an XML document, you can then base the HTML formatting on those labels. So, for example, say you have an XML document that includes a list of companies and some information about each of those companies. To transform this list into an HTML Web page in which every company name is bold, you simply use an XML-to-HTML converter, and instruct the converter to bold every line that's tagged as a <companyName>. This means you no longer have to go through and format each company name and address manually. The potential time savings for Web site creators are huge.
Information exchange
Because computer programs have been developed by many different people and organizations for many different uses, they store information in many different formats. For example, two different companies may store their customer information in two completely different formats, even though the customer information stored by the companies (name, address, phone number, and so forth) is basically identical.
XML solves this kind of problem by providing a standardized, nonproprietary format for the transfer of information between applications. XML was developed, refined, and approved by a group of professionals from different industries working together as part of the World Wide Web Consortium (the W3C). The specification is available to anyone who wants to use it (see www.w3.org), and many organizations and industries already do.
If two companies agree to use software that can convert their records to XML with an agreed-upon DTD, they can exchange those records at will, with no risk of data loss due to incompatible formats. For more information on DTDs and information exchange, see "Industry-Standard DTDs" in this chapter.
For a more in-depth discussion of XML, See "Working with XML" in this chapter.
An XML document contains structured data that has been broken down into "elements," each of which is described with XML tags.
XML elements and XML tags
An XML element contains a nugget of information, such as a company name, a headline, or a part number. You create an element by putting a piece of information between two XML tags: an opening tag containing the element's name between a less-than symbol and a greater-than symbol, and a closing tag that is the same except for the inclusion of a slash (/) before the element name. For example, a tagged "name" element might look like this:
<name>Gertrude</name>It's important to understand the difference between an XML element and an XML tag. An XML tag is simply the label that is attached to a chunk of information; an XML element includes both the chunk of information and the tags that surround it.
XML tags let you describe and add structure to the data they surround. For instance, the following introductory paragraph is tagged with an <introduction> tag:
<introduction>Within the <introduction> element, you can tag other sub-elements to add more structure to your document:
<introduction>Syntax is important for XML tags. Unlike HTML tags, they are case-sensitive; a <Name> tag is different from a <name> tag, which is different from a <NAME> tag. Each XML tag name must begin with a letter or an underscore (_); subsequent characters in the name may be letters, underscores, numbers, hyphens and periods, but not spaces or tabs. For example, the XML tag name <_.dir> is a correctly formed tag name, but the names <_ dir> and <.dir> are not. The <_ dir> tag name is incorrect because it contains white space (a tab or space) after the underscore. The <.dir> tag name is incorrect because it begins with a period instead of an underscore or letter.
It's useful to be aware of the difference between "elements" and "element types." An element type can be thought of as a specific tag name that can be applied to data; an element is a piece of data and the tags that surround it. For example, a document containing a list of names and addresses might have only two element types, <name> and <address>, but hundreds of elements that use those tags.
XML Attributes
Let's say you're working with elements tagged as <car>, and you want to be able to specify additional information about each <car> element you create. For example, say you want to be able to designate a specific <car> element as not just a car, but a fast, red, expensive car.
There are several ways you could do this. One way involves creating additional element types, like so:
<car>Another (and perhaps "cleaner") way to do it is with an XML feature called attributes. Attributes are designed to provide information about an element. They are included within an element's start tag, so there is never any doubt about which element they are related to.
An attribute consists of an attribute name, followed by an equals sign, followed by an attribute value between quotation marks. For example, the following single element uses three attributes to provide the same information as the example above:
<car speed="fast" color="red" cost="expensive">Attributes are useful for several reasons. For example, they make it easy to search a document and generate a list of all the <car> elements that contain the value "expensive" in the cost attribute. They can also be useful in conjunction with empty elements; see the next section for details.
Empty elements
Empty elements include a start tag and an end tag, and do not surround any data, like so:
<IDnumber></IDnumber>Since empty elements have no content between their tags, the starting and closing tags are often combined, like so:
<IDnumber/>You can use attributes along with empty elements to reference URLs or externally-stored files. For example, the following empty element could be used (with an appropriate XML interpreter) to display a picture of a car:
<carPicture URL="1995GeoMetro.jpg"/>Note that simply adding an attribute named "URL" to an element does not guarantee that the URL will be accessed when the XML file is processed. The application that processes the file must know what to do with the URL attribute.
Comments
Just as in HTML, you can include comments in an XML file. Comments are bracketed by <!-- and -->, and are essentially ignored by XML processors. So, for example, to insert a comment about the status of an <address> element, you could do the following:
<Address>Processing instructions
In HTML, comments are commonly used to contain special commands for browsers and other HTML processors. In an effort to restrict XML comments to being just that comments the authors of the XML specification have included a method for inserting customized commands in XML files and DTDs. Such customized commands, called processing instructions (or "PIs"), are simply enclosed between a <? and a ?>. They begin with an application name, followed by a space and any information that might be of interest to the named application. Processing instructions can be used anywhere that comments can appear.
XML declaration
Each XML document should begin with an XML declaration. Like a processing instruction, an XML declaration is enclosed between a <? and a ?>. Here's an example of an XML declaration:
<?xml version="1.0" standalone="yes"?>The "version" attribute declares that this document adheres to the rules of XML 1.0. The "standalone" attribute indicates that all markup declarations needed to process this XML document are included in the document.
Entity references
An entity reference is a word that serves as shorthand for a character, string, or file. For example, by using the < entity reference to represent a less than character (<) in the content of an XML document, you can avoid confusing the XML parser (which would otherwise erroneously read the "<" character as the beginning of a tag). For more information on entity references, see "Entity References" in the "Working with DTDs" section of this chapter.
Well-formed XML
For an XML document to be well-formed, it should begin with an XML declaration and have a root element that contains all of its other elements (<article> in the example below). Well-formed XML also requires every element in the document to have a corresponding end tag. The following is an example of a well-formed XML document:
<?xml version="1.0" standalone="yes"?>Valid XML
A well-formed XML document can be limited in its usefulness unless it is also valid. An XML document is considered to be valid when it adheres to the specifications of a specific DTD. For more information about DTDs and validating XML documents, see "Working With DTDs" in this chapter.
XML processors
An XML processor is, quite simply, a program that reads an XML file and does something with it. There are various kinds of XML processors. An XML processor might convert an XML file into an HTML Web page, a PDF file, or a PostScript file. Or it might read the XML file's content out loud, or convert the content to braille. An XML processor might even be used to copy structured XML content into a database.
XML parsers
An XML parser recognizes the rules of XML and checks to see if an XML document is well-formed. However, an XML parser does not necessarily check to see if an XML document is valid according to its DTD; this requires a validating XML parser (see below).
Validating XML parsers
Validating XML parsers compare an XML document to a DTD and verify whether the document conforms to the DTD's rules. A good validating parser will also provide constructive feedback about any problems it finds in the XML file. For more information about XML parsers, see "Working With DTDs" in this chapter.
For a quick reference to XML features and conventions, see Appendix A, "XML Quick Reference," in Chapter 7, "Appendices."
A DTD (document type definition) specifies which elements an XML file may contain and how those elements must be structured. XML documents don't necessarily have to have a corresponding DTD; as long as an XML file follows basic XML syntax, it's considered to be "well-formed" and can be read by an XML-savvy application. But only if an XML file adheres to a particular DTD can it be considered "valid."
DTDs are important because they provide a reliable, well-documented structure for XML documents. Without DTDs, two organizations that work together might decide to structure and tag their XML documents in entirely different ways; thus, their data stores would remain incompatible even after they both have made the transition to XML. However, if both organizations have the same DTD perhaps a DTD they developed together, or a DTD that has become standard in their industry they can exchange information easily and predictably.
External and internal DTDs
There are two kinds of DTDs: External DTDs and internal DTDs.
Technically, a DTD consists of the list of markup declarations (element declarations, attribute declarations, entities, notations, processing instructions, and comments) that is referenced by a DOCTYPE declaration. What this document refers to as "external DTDs" and "internal DTDs" are not technically complete DTDs; however, it is convenient and fairly common to refer to them as such.
External DTDs
An external DTD (or external subset) is a file containing a list of markup declarations. External DTDs are easy to share among XML documents and between organizations. To use an external DTD in an XML file, you simply reference it at the beginning of the XML file, like so:
<?xml version="1.0" standalone="no">Internal DTDs
An internal DTD (or internal subset) is actually included in the XML file it describes. To use an internal DTD in an XML file, you simply add it to the beginning of the XML file, like so:
<?xml version="1.0" standalone="yes">If a document uses an external DTD (or any other sort of external entity), the "standalone" attribute in the first line must be set to "no." For more information, see "Using entity references" in this section.
Combining internal and external DTDs
In a given XML document, you can specify an external DTD, then add to or override that DTD with an internal DTD. Here's how such an XML document might look.
<?xml version="1.0" standalone="no">Planning a DTD
You probably don't want to just sit down and start writing a DTD; it requires considerable planning if you want to do it right.
Before you begin the process of creating your own DTD, you may want to consider using an industry-standard DTD. For more information on this option, see "Industry-Standard DTDs" in this chapter.
A good way to begin is to figure out what exactly you want your DTD to do. First, figure out which elements you want to use. If you want to use elements such as <address>, think about whether you want to subdivide those elements into subelements such as <streetAddress>, <unitNumber>, <city>, <state>, and <ZIPcode>. (Give such subdivisions serious consideration if there's any chance that you may one day transfer the contents of your XML files into a database.)
So much for the easy part. Next, you need to figure out the relationships between all of these elements. A DTD can specify which elements are allowed, what order they must be in, and which (and how many) subelements they may contain. It can specify which other elements can contain a given element, and it can specify whether a given element must contain data or not.
Eliotte Rusty Harold, in XML: Extensible Markup Language, recommends using a table to help you figure out the relationships between the various elements in your DTD. The table should have the following columns (the data in the columns is provided as an example only):
Element Name | Must Contain | May Contain | Must Be Contained By |
<address> | <streetAddress>, <city>, <state>,~<postalCode>, | <careOf> | <personalData> |
<streetAddress> | <address> |
Each row in the table should represent an element that you want to use in your DTD.
Using a DTD
Like an XML file, a DTD consists of plain text. An XML file may use no DTD, an external DTD, an internal DTD, or both an external and an internal DTD.
Regardless of which type of DTD an XML document uses, it must reference or include that DTD in its prologue (opening section), just after the XML declaration and before the body of the XML document. The DTD section begins with "<!DOCTYPE rootname [" and ends with "]>". Here, for example, is a complete XML document containing a complete DTD (in bold):
<?xml version="1.0" standalone="yes">Let's break that down a little bit:
As you can see, each element type definition specifies both the element's name and the kind of data that element may contain. If you wanted to change the element type definition for <message> so that it could contain text and only text (that is, no other elements), you could do so by changing the "ANY" keyword to "(#PCDATA)", like so:
<?xml version="1.0" standalone="yes">But you probably wouldn't want to do this, because that would mean your document's root element could contain only parsed character data (see note below); you wouldn't be able to add more elements to subdivide the information.
"PCDATA" stands for "parsed character data": that is, text that may include entity references, comments, and processing instructions.
Let's take a look at a more realistic DTD. The following internal DTD defines a structure for a directory of branch offices:
<!-- Root element is <branchOfficeDirectory> -->Note that we've inserted a comment indicating that <branchOfficeDirectory> is the root element of the DTD. We did this because a DTD can't explicitly designate a root element; specifying the root element is technically the job of the !DOCTYPE line in an XML document. But it's a good idea to specify root elements with a comment so users of the DTD can see what they are.
Some DTDs may contain more than one element that can serve as a root element. For example, you can write a DTD that contains definitions for both white paper documents and FAQ documents, then use that DTD to create both kinds of document simply by specifying <whitePaper> or <FAQ> as the root element of each XML file.
The remaining lines in the DTD declare elements for each office's street address, city, state, postal code, country, phone number, fax number, and e-mail address.
Controlling tag selection and order
The above DTD might work just fine for you, but it doesn't really take advantage of XML's features. For example, it doesn't specify any means of indicating which address elements go with which offices, and it doesn't specify any particular order for the information. So you could create a document that lists all the various cities, streets, phone numbers, and so forth in random order, and it would still be valid according to this DTD.
To give the DTD a meaningful structure, you need a way to tie all of the component elements for each listing together and put them in a particular order. One way to do this is by creating a container element to contain the relevant information for one office (we'll call it <branchOffice>), and then specifying which subelements must make up that container element and the order in which they must fall. We can do all of this by adding one line to the DTD (in bold):
<!-- Root element is <branchOfficeDirectory> -->What this new element says is, "If the document contains an element named <branchOffice>, that element must contain exactly one of each of the following elements, in this order, and nothing else."
What if some of your branch offices have more than one line in their street address? You can allow one or more occurrences of any element in a list of subelements by adding a + to the end of the element's name. For example, to allow one or more <streetAddress> elements in our <branchOffice> element, we could do the following:
<!ELEMENT branchOffice (streetAddress+, city, state, postalCode, country, phone, fax, eMail)>What if some of your branch offices don't have fax machines? And what if some of them have more than one? To specify zero or more occurrences of an element, tack a * onto the end of the element name, like so:
<!ELEMENT branchOffice (streetAddress+, city, state, postalCode, country, phone, fax*, eMail)>And what if some of your branch offices are in a country where there is no such thing as a postal code? To specify that zero or one occurrences of a given element may occur, add a question mark to the end of the element's name, like this:
<!ELEMENT branchOffice (streetAddress+, city, state, postalCode?, country, phone, fax*, eMail)>What is called a "state" in the United States may have a different name elsewhere. Canada, for example, is divided into provinces. If you have offices in both the United States and Canada, you may want to provide the option of using a <state> element or a <province> element. You can do this by putting the two options between a pair of parentheses, separated by a | character, like so:
<!ELEMENT branchOffice (streetAddress+, city, (state|province), postalCode?, country, phone, fax*, eMail)>Lastly, we can make sure that a <branchOfficeDirectory> consists of nothing but <branchOffice> listings by changing the definition of <branchOfficeDirectory> from ANY to (branchOffice*). Here's our final product:
<!-- Root element is <branchOfficeDirectory> -->To review:
Symbol | Meaning |
None | Exactly one |
+ | One or more |
* | Zero or more |
? | Zero or one |
The special symbols can be used in conjunction with the parentheses to create complex element type declarations such as the following DTD, designed to list contact information on a day-by-day basis:
<!-- Root element is <contactSchedule> -->On any given day, the subject of this list might be in the office, at home, or away on a business trip. Thus, each <contactInfo> element may include one of the following lists of information, with subelements in the order given:
Allowing empty tags
You might want to write your XML documents in such a way that they are easily translated into HTML format. If so, you might want to include tags such as <BR> and <HR> in your XML file, with an eye toward translating them verbatim into the HTML file.
You can't really do this per se in XML, because every element must have a closing tag. However, you can create what are called EMPTY tags and let an XML-to-HTML converter worry about translating them into the proper output tags. For example, to allow the creation of <HR> tags, you would include the following line in the DTD:
<!ELEMENT HR EMPTY>To use this tag, you could insert a line like the following into your XML file:
<HR/>You can't include it as "<HR>", because every XML tag must either have a closing tag or end with a forward slash, but that's okay; an XML-to-HTML converter should convert the <HR/> to an <HR>.
EMPTY tags are often used to contain images. The URL of the image data is stored in one of the EMPTY tag's attributes. For more information about attributes, see "Defining attributes" in this section.
Using character references
A character reference is a way of representing Unicode characters in parsed character data. The syntax for character references is as follows:
&#UnicodeValueOfCharacter;For example, to insert the euro monetary sign before the number 500 in an <amount> element, you could do the following (character reference in bold):
<amount>€5000</amount>It is the job of the XML processor to substitute the appropriate Unicode characters for character entity references at output.
Using entity references
An entity reference is a little bit of text that represents something else, such as a character, a string of text, an externally-stored XML file, or a binary file (such as a picture or sound file). There are five kinds of entity references:
These entity reference types are described in detail below.
What's the difference between an entity and an entity reference? An entity reference is the shorthand you insert into an XML document to represent an entity. An entity is the content that replaces the entity reference when the XML is processed.
Parsed internal entity references
A parsed internal entity reference is basically shorthand for a string of characters that you plan to reuse often within a given XML document. The format for declaring a parsed internal entity reference in a DTD is as follows:
<!ENTITY entityName "replacement text">For example, say you're building an XML document that contains a list of employees and some information about each of them. Each employee's record needs to contain the phrase "Years with the company:", followed by a number. Rather than typing the phrase over and over again manually, you can create a parsed internal entity reference for the phrase as part of the document's DTD, as follows:
<!-- Root element is <employeeListing> -->To use the "yrs" parsed internal entity reference in the XML document, we might do the following:
<employee>When this <employee> element is processed, the XML processor will expand the parsed internal entity reference, resulting in the following XML:
<employee>There are five predefined parsed internal entity references available in XML. Unlike all other parsed internal entity references, these are part of the XML specification and do not need to be declared.
Character | Entity reference |
< | < |
> | > |
& | & |
" | " |
' | ' |
For example, say you need to use a greater than sign (>) in your XML document's content. As you know, the greater than sign indicates the closing of a tag in XML. In order to avoid confusing the XML processor, you can substitute ">" for the greater than sign wherever it occurs. For example, to express "the whole > the sum of its parts" in an XML file, you could do the following:
<platitude>There are three restrictions to using parsed internal entity references:
Parsed external entity references
A parsed external entity reference lets you include content stored in an externally-located text file. Parsed external entity references should be declared in the DTD in one of the following ways:
<!ENTITY entityName SYSTEM "URL of file to be referenced">The first example lets you use the URL of a particular file. The second example lets you use the name of a resource, which may in turn point to a URL; the URL that follows is a "backup" URL, to be used only if the name cannot be resolved.
Parsed external entity references can be used to share content between XML files. For example, here's a complete sample XML document in which the content is stored in a text file named "myfile.txt" on the Quark™ Web site:
<?xml version="1.0" standalone="no">This is handy because it lets you also use the content in "myfile.txt" in other XML files.
If a document uses external entity references, you should set the "standalone" attribute in the XML declaration to "no."
Unparsed external entity references
What if you want to reference a picture, spreadsheet, sound file, HTML file, or other non-XML file in an XML document? You can't use a parsed external entity reference because the XML processor will try to parse your binary file, and that will lead to errors.
To get around this problem, you can supply a notation at the end of the external entity reference. A notation simply tells the XML processor not to parse the target file, and indicates what kind of file it is. The format for declaring a notation in a DTD is as follows:
<!NOTATION notationName SYSTEM "ApplicationName">For example, to draw a connection between JPEG files and Adobe® Photoshop®, you could add a notation such as this one to the DTD:
<!NOTATION jpeg SYSTEM "Adobe Photoshop">To utilize a notation in an external entity reference declaration, use the following syntax:
<!ENTITY entityName SYSTEM "URL" NDATA notationName>For example, to create an entity named "myPicture" that points to a URL containing a JPEG file, you could use the following tag:
<!ENTITY myPicture SYSTEM "http://www.quark.com/picture.jpg" NDATA jpeg>You can also use the PUBLIC syntax with notations, specifying first a public notation name and then a backup notation URL:
<!ENTITY myPicture PUBLIC "-//Quark//Fictional JPEG Name"~"http://www.quark.com/xml/picture.jpg" NDATA jpeg>Other ways to reference external files: Unparsed entity references are not the only way to reference external files in XML files without specifying that they must be parsed. You can also store the URL of such a file as plain element or attribute content. The first example below references the URL of a picture file as element content, and the second example references the same URL as attribute content:
Whether you choose to use unparsed entities, elements, or attributes to reference non-XML files is up to you. Any of these methods will work equally well, as long as the application that processes the XML knows that the URLs are URLs.
Internal parameter entity references
If you want to create an entity reference that is used only within a particular DTD, you must create a parameter entity reference. An internal parameter entity reference is very similar to a parsed internal entity reference, except it begins with a % instead of a &, both in its declaration and when you use it:
<!ENTITY % entityName "entity definition">You can use internal parameter entity references in a DTD's external subset in the same way you use parsed internal entities in an XML document. For example, here we use an internal parameter entity reference to create a shorthand way of referring to a content model that describes a person's name:
<!ENTITY % name "(firstName, lastName)">This is useful because it makes it easy for you to change the definition of all types of names at one time. So, for example, if you decided you wanted to also store middle names for all employers, employees, and customers, you could just change the internal parameter entity declaration above to:
<!ENTITY % name "(firstName, middleName, lastName)">Note that this kind of internal parameter entity reference can be used only in a DTD's external subset.
External parameter entity references
An external parameter entity reference is very similar to a parsed external entity reference, except it begins with a % instead of a &, both in its declaration and when you use it. For example, the following two lines (from an XML document's internal subset) first create an entity reference pointing to an external DTD called "standardHeader.dtd" and then include that external DTD in the XML file:
<!ENTITY % standardHeader SYSTEM "standardHeader.dtd">For more information on this usage, see "Using public DTDs" in this section.
Parameter entity references can be used only within a DTD.
Internal and external parameter entity references can be used together. For example, you can use internal parameter entity references in the internal subset to reference entities that are defined in the external subset. This is useful because it lets you change the definition of an entity without having to change the internal subset of XML files that use the entity. So, for example, you could include the following declaration in a text file named "entitiesFile.txt":
<!ENTITY % nameEntity "<!ELEMENT name (firstName, lastName)>">And then, in the internal subset of XML documents, include the following:
<!-- Include the file containing the above entity -->>This would enable you to change the definition of the nameEntity entity reference in any number of XML documents simply by changing it in the "entitiesFile.txt" file.
Defining attributes
In addition to containing content, elements can also have attributes (see "Understanding XML" in this chapter). There is disagreement about the role of attributes, but for the purpose of this discussion, we'll assume that an attribute should contain information about an element that is important to the XML processor, but is not part of the content of the XML file itself.
For example, say you're using XML to maintain a list of books for display on a Web site. The list can be displayed in two ways: as a full list, or as a list of all of the books that have been added to the list in the past 10 days. In order to make this work, the XML document needs to indicate the date on which each book is entered.
You could add a <dateEntered> subtag to the definition of the <book> tag, but the date on which a given book is entered into your system isn't really a bit of information about the book itself, so you might choose instead to create an attribute named "dateEntered."
The syntax for attribute declarations is as follows:
<!ATTLIST elementName attributeName AttributeType DefaultValue>So, to give the <book> element a "dateEntered" attribute with a default value of "01/01/2000," we would add the following line to the DTD:
<!ATTLIST book dateEntered CDATA "01/01/2000">Then to use this attribute in a <book> element, we would simply use an attribute-value pair, like so:
<book dateEntered="11/11/1998">This attribute would give the XML processor the information necessary to display books based on their entry date.
Required, implied, and fixed attributes
Each attribute may be required, implied, or fixed. A required attribute default specifies that the element must contain this attribute. For example, the following attribute declaration specifies that each <book> element must have a "dateEntered" attribute:
<!ATTLIST book dateEntered CDATA #REQUIRED>An implied attribute default indicates that the element may or may not contain this attribute, at the XML author's discretion. For example, the following attribute declaration specifies that each <book> may or may not contain a "dateEntered" attribute:
<!ATTLIST book dateEntered CDATA #IMPLIED>A fixed attribute value indicates that the attribute must contain an exact value for each element. For example, the following attribute declaration specifies that every <book> must have a "dateEntered" value equal to "11/11/1998":
<!ATTLIST book dateEntered CDATA #FIXED "11/11/1998">In this example, the XML processor will assume that every <book> element has a "dateEntered" attribute set to "11/11/1998," even if the attribute is omitted.
If an attribute declaration has a default value, but does not specify #REQUIRED, #IMPLIED, or #FIXED, the XML processor will assume the default value for the attribute whenever the attribute is omitted.
Attribute types
The CDATA keyword in our sample attribute declaration indicates that we want this attribute to contain character data. CDATA is only one option for attribute type, however. The full list follows.
An ID attribute must have a declared default of #IMPLIED or #REQUIRED. No element may have more than one ID attribute.
No element may have more than one NOTATION attribute.
Here, instead of creating one element for both images and movies, we create two separate elements, <image> and <movie>. For each of these elements, the DTD specifies two different applications that might be used to view the file. The determination of which application to use is made in each individual <element> tag in the XML body.
The xml:lang attribute
The "xml:lang" attribute lets you specify which language is used in an element. This attribute should contain one of the following:
* A two-letter language code defined by ISO 639, optionally followed by a hyphen and a subtype (typically a country code)
* An IANA-registered language number, prefixed with "i-" or "I-"
* A user-defined language code, prefixed with "x-" or "X-"
Note that these attributes are not predefined you must declare them before you use them.
To indicate the language you want, simply assign that language's code. For example, the following DTD specifies an "xml:lang" element, and the element in the XML body specifies the English language using ISO 639:
<!-- In the DTD -->You can specify language subtypes by adding a hyphenated extension to the language name. For example, the following element specifies International English (used in Great Britain), as opposed to U.S. English:
<!-- In the XML body -->The xml:space attribute
The "xml:space" attribute lets you indicate to the application that processed the XML that it should leave all white space for an element and its children as is (unless one of the element's children resets the tag). For example, the following DTD specifies an "xml:space" attribute, and the element in the XML body sets that attribute to "preserve" for that element and its children:
<!-- In the DTD -->IGNORE and INCLUDE
You can use the <![IGNORE[]]> tag to tell the XML parser to ignore a stretch of text in an external DTD. Take for example the following:
<-- This element declaration is parsed as usual: -->You can tell the XML parser to parse the text within the tags by simply changing the IGNORE to an INCLUDE, as follows:
<-- This element declaration is parsed as usual: -->Using public DTDs
As we mentioned earlier, you can refer to an external DTD in an XML document's DOCTYPE declaration, like so:
<?xml version="1.0" standalone="no">If you are using a DTD that has been approved by a body such as the the International Standards Organization (ISO), you can use a PUBLIC entity reference that specifies the name of a publically available copy of the DTD. When you do this, you must also supply the URL of a SYSTEM DTD file, so there's something to fall back on if the PUBLIC copy of the DTD is unavailable.
<?xml version="1.0" standalone="no">Combining DTDs to create composite DTDs
Sometimes you may create separate DTDs to define different parts of a document. For example, your organization may use one DTD for all of its XML files' header and footer information, but different DTDs for the body of documents produced in different parts of the company. You can accommodate situations such as this one by simply creating a single new DTD that includes the various DTDs you need and specifies an order for their root elements, like so:
<!ENTITY % standardHeader SYSTEM "standardHeader.dtd">For documents created with this DTD, <QAReptDoc> would be the root element, and <standardHeader>, <QARept>, and <standardFooter> would be its immediate subelements. A document that uses this DTD might look something like this:
<?xml version="1.0" standalone="no">Making local modifications to imported DTDs
Some workflows may involve DTDs that are almost identical for a group of uses, but which require small adjustments to work in any particular department or group. This is easy to arrange; you simply include the DTD in the DOCTYPE declaration, then add any necessary markup declarations to the internal subset.
You cannot redefine an element that is already defined in the external DTD, but you can redefine entities and default values for attributes.
Validating an XML file against a DTD
If you're writing your XML documents with a word processor, you can read through the corresponding DTD and make sure that you follow the rules. But you won't really know for sure whether you did until you validate the XML document against the DTD using a program called a validating parser. The validating parser reads the DTD and then checks your XML file to make sure it adheres to the DTD's rules. A good validating parser should also tell you what problems it finds (if any).
Remember that if you want to check an XML document for adherence to a particular DTD, you need a validating XML parser, not just a plain XML parser. There are many XML parsers that will tell you if an XML file is well-formed, but considerably fewer that will tell you if an XML file is valid.
For a quick reference to DTD features and conventions, see Appendix B, "DTD Quick Reference," in Chapter 7, "Appendices."
Should you develop a new DTD, custom-designed to fit the needs of your organization? Or should you use an industry-standard DTD that will save you development time and help to ensure that you can exchange information with other organizations in your industry?
There are advantages to both approaches. If you create your own DTD from scratch, you have total control over the structure of that DTD and the process of updating it. However, you're also looking at a significant investment of time and effort, and you must be very careful to consider the needs of everyone who will be using that DTD. If you use an industry-standard DTD, you don't have to go through the DTD development process, but you have to follow the DTD's conventions and adhere to the structure it defines.
Pros and cons of using industry-standard DTDs
If you plan to exchange information with other organizations, an industry-standard DTD might be a good idea. Using an industry-standard DTD can help ensure that information exchange goes smoothly, and that the information you tag can be reused in other contexts. Indeed, this is one of the reasons XML was developed: To help standardize the formats in which information is stored and exchanged.
But using an industry-standard DTD can present its own challenges, because two organizations may have very different needs, even if the data they work with is essentially the same. Industry-standard DTDs can be modified for use within an organization, but that partially defeats their purpose, which is to ensure that information is stored in a consistent format between organizations.
Can I use an industry-standard DTD?
Whether you can use an industry-standard DTD depends on a number of factors.
Does an industry-standard DTD exist for your industry?
To find out the answer to this question, you can look for industry-standard DTDs on the World Wide Web. Two good places to look are www.schema.net and www.xml.org.
If an industry-standard DTD exists, does it meet your needs?
Think carefully about this question; if the DTD you choose does not meet your needs, the cumulative effect of any shortcomings will probably grow with time.
If no industry-standard DTD exists for your industry, is one in development?
If you can't find an industry-standard DTD that fits your organizational needs, you might want to find out if anyone else in your industry is developing one. If so, your organization may have a chance to bring its particular expertise to bear in the development of the DTD. Participation in the development of an industry-standard DTD can help you avoid problems that can result from adopting a DTD developed by someone who doesn't understand your issues.
Extending industry-standard DTDs
Some organizations choose to use an industry-standard DTD, but modify that DTD to make it suit their particular needs. For example, to make the ISO-standard "book" SGML DTD work for them, the University of California Press made a series of adjustments to it, adding elements that let them store information such as chapter subtitles and chapter-specific bylines. The ISO (International Standards Organization) provides guidelines for modifying its DTDs, so even if you make such modifications, your new DTD is still somewhat standardized.
What if you need to exchange data with other organizations that use the original, unmodified DTD? Some organizations choose to create utilities that can convert documents that adhere to their modified DTD into documents that adhere to the original form of the DTD. This kind of solution gives you many of the advantages of having a customized DTD, yet still allows you to exchange data with other organizations in the industry.
Avenue.quark lets you use a DTD to extract structured content from QuarkXPress documents and store that content in the file system or in a database. The following section describes how the process works using a sample situation.
The situation
Let's say your organization has created a large number of technical documents in QuarkXPress format, and you'd like to export their content in XML format and store it in a database so you can make it available to your customers on the Web. The technical documents all use the same QuarkXPress template and style sheets.
Step 1: Create or choose a DTD
Before you can extract the technical documents' content in a structured format, you must have a structure to contain that content. The DTD provides that structure.
For more about DTDs, see "Working with DTDs" in this chapter.
There are two ways to acquire a DTD for use with avenue.quark:
Step 2: Create an XML document
Create a new XML document in avenue.quark and specify the DTD you chose in Step 1. Any mandatory elements in the DTD are automatically inserted in the XML document.
XML Workspace palette for a new XML document
Step 3: Create a tagging rule set
One of the unique features of avenue.quark is rule-based tagging. In rule-based tagging, you create a set of tagging rules that tell avenue.quark, for example, that a paragraph that uses the "Headline" style sheet should usually be tagged as a <Title>. You can also use tagging rule sets to specify how particular character style sheets, text colors, and local formatting styles should be tagged. (For more information about tagging rule sets, see Chapter 5, "Tagging Rule Sets.")
Step 4. Save the XML document as a template
Save the XML document as a template named "TechNote.xmt." The template contains the technical document DTD and the tagging rule set you created in Step 3. You can use this template to create as many XML files as you want, on the same computer or on several computers.
Step 5. Open the QuarkXPress document you want to tag
Step 6. Create a new XML document based on the XML template
When you create a new avenue.quark XML document, the first thing you must do is choose a template from the Template list on which to base the new XML document. For this example, we'll use the "TechNote.xmt" from Step 4.
The TechNote.xmt template makes it easy to tag a QuarkXPress document.
Step 7. Perform rule-based tagging
To perform rule-based tagging, Command+drag (Mac OS) or Ctrl+drag (Windows) the box containing the technical document to the <techNote> element in the XML Tree scroll list. Avenue.quark automatically tags the document using the rules in the tagging rule set.
To use rule-based tagging, simply Command+drag (Mac OS) or Ctrl+drag (Windows) the box to the appropriate element in the XML Tree list. Avenue.quark uses the tagging rule set to tag as much of the content as it can.
Step 8. Perform any necessary manual tagging
Some of your technical documents may be ready to go after rule-based tagging has been completed. Others, though, may have additional content that needs to be tagged manually, or occurrences of content that could be tagged in more than one way. To resolve such situations, you simply drag the content in question onto the appropriate element in the XML Workspace palette.
Step 9. Use your structured content on the Web and elsewhere
Once your technical documents' content is in XML format, you can use a variety of tools to put it on the Web. For example, you can serve it as straight XML and view it using a newer Web browser such as Microsoft Internet Explorer 5.0. XML-tagged content can also be used in a wide variety of other ways, for everything from electronic information exchange to the generation of printed documents.